Clustering adalah sebuah teknik atau algoritma pengelompokkan yang digunakan dalam area unsupervised learning. Berbeda dengan klasifikasi, clustering tidak menggunakan data dengan label yang sudah diketahui. Data dikelompokkan dengan mengkalkulasi tingkat kemiripan (similarity) antar data.
Sumber : Abla Chouni Benabdellah, Asmaa Benghabrit, Imane Bouhaddou, A survey of clustering algorithms for an industrial context, Procedia Computer Science, Volume 148, 2019, Pages 291-302, ISSN 1877-0509
Dataset:
import pandas as pd
df = pd.read_csv('cluster.csv',parse_dates=['trans_date'])
#df=pd.read_csv('cluster.csv')
df
| customer_id | trans_date | tran_amount | |
|---|---|---|---|
| 0 | CS5295 | 2013-02-11 | 35 |
| 1 | CS4768 | 2015-03-15 | 39 |
| 2 | CS2122 | 2013-02-26 | 52 |
| 3 | CS1217 | 2011-11-16 | 99 |
| 4 | CS1850 | 2013-11-20 | 78 |
| ... | ... | ... | ... |
| 124995 | CS8433 | 2011-06-26 | 64 |
| 124996 | CS7232 | 2014-08-19 | 38 |
| 124997 | CS8731 | 2014-11-28 | 42 |
| 124998 | CS8133 | 2013-12-14 | 13 |
| 124999 | CS7996 | 2014-12-13 | 36 |
125000 rows × 3 columns
#print random customer
print(df[df['customer_id']=='CS4657'])
print(df['customer_id'][df['customer_id']=='CS4657'].value_counts())
customer_id trans_date tran_amount 2104 CS4657 2011-06-24 53 2302 CS4657 2012-06-06 73 2979 CS4657 2012-08-17 81 4133 CS4657 2014-05-30 42 4343 CS4657 2015-03-16 100 33260 CS4657 2012-12-02 75 35775 CS4657 2013-07-22 98 42793 CS4657 2012-09-21 101 46591 CS4657 2012-02-17 63 66684 CS4657 2015-03-03 90 80804 CS4657 2011-09-17 81 80838 CS4657 2012-01-19 105 81325 CS4657 2012-02-04 46 87537 CS4657 2013-01-24 43 91174 CS4657 2012-08-31 80 94166 CS4657 2012-11-12 80 94940 CS4657 2014-03-17 46 CS4657 17 Name: customer_id, dtype: int64
df.info(verbose=True, null_counts=True)
<class 'pandas.core.frame.DataFrame'> RangeIndex: 125000 entries, 0 to 124999 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 customer_id 125000 non-null object 1 trans_date 125000 non-null datetime64[ns] 2 tran_amount 125000 non-null int64 dtypes: datetime64[ns](1), int64(1), object(1) memory usage: 2.9+ MB
C:\Users\hrd\anaconda3\envs\Keras\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: null_counts is deprecated. Use show_counts instead """Entry point for launching an IPython kernel.
A. EDA
#grouping data based on newest transaction
df_recency=df.groupby('customer_id', as_index=False)['trans_date'].max()
df_recency
| customer_id | trans_date | |
|---|---|---|
| 0 | CS1112 | 2015-01-14 |
| 1 | CS1113 | 2015-02-09 |
| 2 | CS1114 | 2015-02-12 |
| 3 | CS1115 | 2015-03-05 |
| 4 | CS1116 | 2014-08-25 |
| ... | ... | ... |
| 6884 | CS8996 | 2014-12-09 |
| 6885 | CS8997 | 2014-06-28 |
| 6886 | CS8998 | 2014-12-22 |
| 6887 | CS8999 | 2014-07-02 |
| 6888 | CS9000 | 2015-02-28 |
6889 rows × 2 columns
#calculate recency for each customer
df_recency['recency']=df_recency['trans_date'].apply(lambda x: df_recency['trans_date'].max()-x).dt.days
df_recency=df_recency.drop(['trans_date'], axis=1)
df_recency.head()
| customer_id | recency | |
|---|---|---|
| 0 | CS1112 | 61 |
| 1 | CS1113 | 35 |
| 2 | CS1114 | 32 |
| 3 | CS1115 | 11 |
| 4 | CS1116 | 203 |
df_recency['recency'].describe()
count 6889.000000 mean 80.538249 std 85.382526 min 0.000000 25% 22.000000 50% 53.000000 75% 111.000000 max 857.000000 Name: recency, dtype: float64
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(18,7))
g1=sns.histplot(x='recency',
data=df_recency,
color='y')
g1.set_xlabel('Recency')
g1.set_ylabel('Counts')
g1.set_title('Recency Distribution')
Text(0.5, 1.0, 'Recency Distribution')
#create data with feature frequency
df_frequency=df.groupby('customer_id', as_index=False)['trans_date'].count()
df_frequency=df_frequency.rename(columns={'trans_date':'frequency'})
df_frequency
| customer_id | frequency | |
|---|---|---|
| 0 | CS1112 | 15 |
| 1 | CS1113 | 20 |
| 2 | CS1114 | 19 |
| 3 | CS1115 | 22 |
| 4 | CS1116 | 13 |
| ... | ... | ... |
| 6884 | CS8996 | 13 |
| 6885 | CS8997 | 14 |
| 6886 | CS8998 | 13 |
| 6887 | CS8999 | 12 |
| 6888 | CS9000 | 13 |
6889 rows × 2 columns
df_frequency['frequency'].describe()
count 6889.000000 mean 18.144869 std 5.193014 min 4.000000 25% 14.000000 50% 18.000000 75% 22.000000 max 39.000000 Name: frequency, dtype: float64
plt.figure(figsize=(18,7))
p4=sns.countplot(x='frequency',
data=df_frequency,
color='y')
p4.set_xlabel('Frequency')
p4.set_ylabel('Counts')
p4.set_title('Frequency Distribution')
plt.show()
#create data with feature monetary
df_monetary=df.groupby(['customer_id'], as_index=False)['tran_amount'].sum()
df_monetary=df_monetary.rename(columns={'tran_amount':'monetary'})
df_monetary.head()
| customer_id | monetary | |
|---|---|---|
| 0 | CS1112 | 1012 |
| 1 | CS1113 | 1490 |
| 2 | CS1114 | 1432 |
| 3 | CS1115 | 1659 |
| 4 | CS1116 | 857 |
df_monetary['monetary'].describe()
count 6889.000000 mean 1179.269705 std 465.832609 min 149.000000 25% 781.000000 50% 1227.000000 75% 1520.000000 max 2933.000000 Name: monetary, dtype: float64
plt.figure(figsize=(18,7))
p7=sns.histplot(x='monetary',
data=df_monetary,
color='y')
p7.set_xlabel('Monetary')
p7.set_ylabel('Counts')
p7.set_title('Counts of monetary')
plt.show()
B. Melakukan Clustering untuk Market Segmentation dengan Analisis RFM
#df_join
df_final=pd.merge(df_recency, df_frequency, on='customer_id')
df_final=pd.merge(df_final,df_monetary, on='customer_id')
df_final=df_final.reset_index(drop=True)
df_final.head(10)
| customer_id | recency | frequency | monetary | |
|---|---|---|---|---|
| 0 | CS1112 | 61 | 15 | 1012 |
| 1 | CS1113 | 35 | 20 | 1490 |
| 2 | CS1114 | 32 | 19 | 1432 |
| 3 | CS1115 | 11 | 22 | 1659 |
| 4 | CS1116 | 203 | 13 | 857 |
| 5 | CS1117 | 257 | 17 | 1185 |
| 6 | CS1118 | 2 | 15 | 1011 |
| 7 | CS1119 | 11 | 15 | 1158 |
| 8 | CS1120 | 10 | 24 | 1677 |
| 9 | CS1121 | 41 | 26 | 1524 |

#with z-score
from sklearn.preprocessing import StandardScaler
df_numerical=df_final[['recency','frequency','monetary']]
sc=StandardScaler()
df_numerical=sc.fit_transform(df_numerical)
print(df_numerical)
df_final[['recency','frequency','monetary']]=df_numerical
df_final
[[-0.22884855 -0.60563997 -0.35910291] [-0.53338262 0.35726189 0.66709123] [-0.56852116 0.16468152 0.54257395] ... [ 0.04054696 -0.99080071 -1.19208059] [ 2.06686975 -1.18338108 -1.70947136] [-0.75592674 -0.99080071 -1.38744391]]
| customer_id | recency | frequency | monetary | |
|---|---|---|---|---|
| 0 | CS1112 | -0.228849 | -0.605640 | -0.359103 |
| 1 | CS1113 | -0.533383 | 0.357262 | 0.667091 |
| 2 | CS1114 | -0.568521 | 0.164682 | 0.542574 |
| 3 | CS1115 | -0.814491 | 0.742423 | 1.029909 |
| 4 | CS1116 | 1.434376 | -0.990801 | -0.691865 |
| ... | ... | ... | ... | ... |
| 6884 | CS8996 | 0.192814 | -0.990801 | -1.282248 |
| 6885 | CS8997 | 2.113721 | -0.798220 | -1.365975 |
| 6886 | CS8998 | 0.040547 | -0.990801 | -1.192081 |
| 6887 | CS8999 | 2.066870 | -1.183381 | -1.709471 |
| 6888 | CS9000 | -0.755927 | -0.990801 | -1.387444 |
6889 rows × 4 columns
#with z-sigmoid g(z)=1/(1-e^(-z))
#!pip install math
import math
df_final_sigmoid=df_final.copy()
col_rfm=[col for col in df_final_sigmoid.columns if col != 'customer_id']
for col in col_rfm:
df_final_sigmoid[col]=df_final_sigmoid[col].apply(lambda x: 1/(1-(math.e**(-x))))
df_final_sigmoid
| customer_id | recency | frequency | monetary | |
|---|---|---|---|---|
| 0 | CS1112 | -3.888756 | -1.201310 | -2.314578 |
| 1 | CS1113 | -1.419066 | 3.328776 | 2.054228 |
| 2 | CS1114 | -1.306073 | 6.586044 | 2.388061 |
| 3 | CS1115 | -0.794896 | 1.908249 | 1.555306 |
| 4 | CS1116 | 1.312791 | -0.590531 | -1.002570 |
| ... | ... | ... | ... | ... |
| 6884 | CS8996 | 5.702404 | -0.590531 | -0.383916 |
| 6885 | CS8997 | 1.137382 | -0.818609 | -0.342519 |
| 6886 | CS8998 | 25.166140 | -0.590531 | -0.435934 |
| 6887 | CS8999 | 1.144926 | -0.441424 | -0.220944 |
| 6888 | CS9000 | -0.885282 | -0.590531 | -0.332823 |
6889 rows × 4 columns
def plt_boxplot(data):
fig, ax = plt.subplots(figsize=(18,7), nrows=3)
plot1=sns.boxplot(x='recency',
y='frequency',
data=data,
ax=ax[0],
palette='husl')
plot1.set_xlabel('recency')
plot1.set_ylabel('frequency')
plot1.set_title('Recency vs Frequency')
plot2=sns.boxplot(x='recency',
y='monetary',
data=data,
ax=ax[1],
palette='husl')
plot2.set_xlabel('recency')
plot2.set_ylabel('monetary')
plot2.set_title('Recency vs Monetary')
plot3=sns.boxplot(x='frequency',
y='monetary',
data=data,
ax=ax[2],
palette='husl')
plot3.set_xlabel('frequency')
plot3.set_ylabel('monetary')
plot3.set_title('Frequency vs Monetary')
plt.tight_layout()
plt.show()
def plot_scatter(data):
column_num=[col for col in data.columns if col != 'customer_id']
for col in column_num:
#plt.figure(figsize=(18,7))
g1=sns.scatterplot(x='customer_id',
y=col,
data=data,
color='b')
g1.set_xlabel('Customer ID')
g1.set_ylabel(col)
g1.set_title(col)
plt.show()
plt_boxplot(df_final)
plot_scatter(df_final)
sns.pairplot(df_final[['recency','frequency','monetary']])
<seaborn.axisgrid.PairGrid at 0x1ebf3459d08>
plt_boxplot(df_final_sigmoid)
sns.pairplot(df_final_sigmoid[['recency','frequency','monetary']])
<seaborn.axisgrid.PairGrid at 0x1ebfcdb8fc8>
from sklearn.cluster import KMeans
from sklearn_extra.cluster import KMedoids
from sklearn.metrics import silhouette_score
import numpy as np
def Kmeans(n_clusters, df):
data=df.copy()
data=data[['recency','frequency','monetary']].values
kmeans=KMeans(n_clusters=n_clusters, init='random', random_state=0)
kmeans.fit(data)
df['Cluster']=kmeans.labels_
df
return df, kmeans.labels_
def Kmedoids(n_clusters, df):
data=df.copy()
data=data[['recency','frequency','monetary']].values
kmedoids=KMedoids(n_clusters=n_clusters, random_state=0)
kmedoids.fit(data)
df['Cluster']=kmedoids.labels_
df
return df, kmedoids.labels_
def Kmeans_plus(n_clusters, df):
data=df.copy()
data=data[['recency','frequency','monetary']].values
kmeans_plus=KMeans(n_clusters=n_clusters, init='k-means++', random_state=0)
kmeans_plus.fit(data)
df['Cluster']=kmeans_plus.labels_
df
return df, kmeans_plus.labels_
from mpl_toolkits.mplot3d import Axes3D
def plot_scatter_3D(labels, df):
labels=labels
fig = plt.figure(figsize=(18,7))
ax=Axes3D(fig)
for label in labels:
df_plot=df[df['Cluster']==label]
x=df_plot['recency'].values
y=df_plot['frequency'].values
z=df_plot['monetary'].values
ax.scatter(x, y, z, label=label)
ax.set_xlabel('Recency')
ax.set_ylabel('Frequency')
ax.set_zlabel('Monetary')
#ax.view_init(azim=45)
ax.legend()
print('after clustering plotting with k =', len(labels))
plt.show()
def elbow_method_kmeans(df):
wccs_list=[] #Within-Cluster Sum of Square
k_list=[]
silhouette_kmeans=[]
data=df.copy()
data=data[['recency','frequency','monetary']].values
for k in range(2,11):
kmeans=KMeans(n_clusters=k, init='random', random_state=0)
kmeans.fit(data)
silhouette_value=silhouette_score(data, kmeans.labels_)
wccs=kmeans.inertia_
k_list.append(k)
silhouette_kmeans.append(silhouette_value)
wccs_list.append(wccs)
return wccs_list, k_list, silhouette_kmeans
def elbow_method_kmedoids(df):
wccs_list=[] # Within-Cluster Sum of Square
k_list=[]
silhouette_kmedoids=[]
data=df.copy()
data=data[['recency','frequency','monetary']].values
for k in range(2,11):
kmedoids=KMedoids(n_clusters=k, random_state=0)
kmedoids.fit(data)
silhouette_value=silhouette_score(data, kmedoids.labels_)
wccs=kmedoids.inertia_
k_list.append(k)
silhouette_kmedoids.append(silhouette_value)
wccs_list.append(wccs)
return wccs_list, k_list, silhouette_kmedoids
def elbow_method_kmeans_plus(df):
wccs_list=[]
k_list=[]
silhouette_kmeans_plus=[]
data=df.copy()
data=data[['recency','frequency','monetary']].values
for k in range(2,11):
kmeans_plus=KMeans(n_clusters=k, init="k-means++", random_state=0)
kmeans_plus.fit(data)
silhouette_value=silhouette_score(data, kmeans_plus.labels_)
wccs=kmeans_plus.inertia_
k_list.append(k)
silhouette_kmeans_plus.append(silhouette_value)
wccs_list.append(wccs)
return wccs_list, k_list, silhouette_kmeans_plus
K-Means algorithm adalah bentuk dari algoritma cluster yang dibuat oleh J.B.MacQueen. Untuk mendapatkan hasil yang optimal, algoritma ini bekerja dengan mencari pembagian-pembagian K yang dapat memenuhi kriteria-kriteria tertentu.
sumber : Youguo Li, Haiyan Wu, A Clustering Method Based on K-Means Algorithm, Physics Procedia, Volume 25, 2012, Pages 1104-1109,ISSN 1875-3892.
K-Means melakukan pengelompokan data berdasarkan jarak terdekat suatu data dengan menggunakan euclidean distance. Cluster center dipilih berdasarkan nilai mean dari hasil random selection terhadap partisi dari dataset. Data akan masuk ke dalam cluster jika data memiliki distance terdekat dengan cluster.
sumber : Preeti Arora, Deepali, Shipra Varshney, Analysis of K-Means and K-Medoids Algorithm For Big Data, Procedia Computer Science, Volume 78, 2016, Pages 507-512, ISSN 1877-0509
df_final_kmeans_k4=Kmeans(2, df_final)
df_final_kmeans_k4
( customer_id recency frequency monetary Cluster 0 CS1112 -0.228849 -0.605640 -0.359103 0 1 CS1113 -0.533383 0.357262 0.667091 1 2 CS1114 -0.568521 0.164682 0.542574 1 3 CS1115 -0.814491 0.742423 1.029909 1 4 CS1116 1.434376 -0.990801 -0.691865 0 ... ... ... ... ... ... 6884 CS8996 0.192814 -0.990801 -1.282248 0 6885 CS8997 2.113721 -0.798220 -1.365975 0 6886 CS8998 0.040547 -0.990801 -1.192081 0 6887 CS8999 2.066870 -1.183381 -1.709471 0 6888 CS9000 -0.755927 -0.990801 -1.387444 0 [6889 rows x 5 columns], array([0, 1, 1, ..., 0, 0, 0]))
labels=[0,1]
plot_scatter_3D(labels, df_final)
C:\Users\hrd\anaconda3\envs\Keras\lib\site-packages\ipykernel_launcher.py:8: MatplotlibDeprecationWarning: Axes3D(fig) adding itself to the figure is deprecated since 3.4. Pass the keyword argument auto_add_to_figure=False and use fig.add_axes(ax) to suppress this warning. The default value of auto_add_to_figure will change to False in mpl3.5 and True values will no longer work in 3.6. This is consistent with other Axes classes.
after clustering plotting with k = 2
#another plot with sns.pairplot (optional)
sns.pairplot(df_final, vars=['recency','frequency','monetary'],
hue='Cluster', palette='bright')
plt.show()
df_final_kmeans_k4_sigmoid=Kmeans(6, df_final_sigmoid)
df_final_kmeans_k4_sigmoid
( customer_id recency frequency monetary Cluster 0 CS1112 -3.888756 -1.201310 -2.314578 4 1 CS1113 -1.419066 3.328776 2.054228 4 2 CS1114 -1.306073 6.586044 2.388061 4 3 CS1115 -0.794896 1.908249 1.555306 4 4 CS1116 1.312791 -0.590531 -1.002570 4 ... ... ... ... ... ... 6884 CS8996 5.702404 -0.590531 -0.383916 4 6885 CS8997 1.137382 -0.818609 -0.342519 4 6886 CS8998 25.166140 -0.590531 -0.435934 4 6887 CS8999 1.144926 -0.441424 -0.220944 4 6888 CS9000 -0.885282 -0.590531 -0.332823 4 [6889 rows x 5 columns], array([4, 4, 4, ..., 4, 4, 4]))
labels=[0,1,2,3,4,5]
plot_scatter_3D(labels, df_final_sigmoid)
C:\Users\hrd\anaconda3\envs\Keras\lib\site-packages\ipykernel_launcher.py:8: MatplotlibDeprecationWarning: Axes3D(fig) adding itself to the figure is deprecated since 3.4. Pass the keyword argument auto_add_to_figure=False and use fig.add_axes(ax) to suppress this warning. The default value of auto_add_to_figure will change to False in mpl3.5 and True values will no longer work in 3.6. This is consistent with other Axes classes.
after clustering plotting with k = 6
wccs_kmeans, k_list_kmeans, silhouette_kmeans=elbow_method_kmeans(df_final)
wccs_kmeans=list(wccs_kmeans)
k_list_kmeans=list(k_list_kmeans)
#wccs result
for i,k in enumerate(k_list_kmeans):
print('k :', k, 'wccs :', wccs_kmeans[i])
k : 2 wccs : 11182.21120278527 k : 3 wccs : 7874.955047042264 k : 4 wccs : 5925.50177070872 k : 5 wccs : 4918.334811117619 k : 6 wccs : 4213.467162911648 k : 7 wccs : 3593.6821891578747 k : 8 wccs : 3266.5160994863127 k : 9 wccs : 2995.9287341922886 k : 10 wccs : 2711.864765807739
g1=sns.lineplot(x=k_list_kmeans,
y=wccs_kmeans)
g1.set_xlabel('K K-Means')
g1.set_ylabel('WCCS K-Means')
g1.set_title('K vs WCCS K-Means')
Text(0.5, 1.0, 'K vs WCCS K-Means')
Berdasarkan dari grafik elbow method dengan inertia dan distortion, jumlah k optimal adalah 4 dikarenakan k=4 adalah titik patahan yang terlihat pada line grafik. (Normalized Data)
# Clustering with K-Means by using k-optimal
df_final_kmeans_k_optimal=Kmeans(4, df_final)
df_final_kmeans_k_optimal
( customer_id recency frequency monetary Cluster 0 CS1112 -0.228849 -0.605640 -0.359103 2 1 CS1113 -0.533383 0.357262 0.667091 2 2 CS1114 -0.568521 0.164682 0.542574 2 3 CS1115 -0.814491 0.742423 1.029909 3 4 CS1116 1.434376 -0.990801 -0.691865 1 ... ... ... ... ... ... 6884 CS8996 0.192814 -0.990801 -1.282248 0 6885 CS8997 2.113721 -0.798220 -1.365975 1 6886 CS8998 0.040547 -0.990801 -1.192081 0 6887 CS8999 2.066870 -1.183381 -1.709471 1 6888 CS9000 -0.755927 -0.990801 -1.387444 0 [6889 rows x 5 columns], array([2, 2, 2, ..., 0, 1, 0]))
labels=[0,1,2,3]
plot_scatter_3D(labels, df_final)
C:\Users\hrd\anaconda3\envs\Keras\lib\site-packages\ipykernel_launcher.py:8: MatplotlibDeprecationWarning: Axes3D(fig) adding itself to the figure is deprecated since 3.4. Pass the keyword argument auto_add_to_figure=False and use fig.add_axes(ax) to suppress this warning. The default value of auto_add_to_figure will change to False in mpl3.5 and True values will no longer work in 3.6. This is consistent with other Axes classes.
after clustering plotting with k = 4
#another plot with sns.pairplot (optional)
sns.pairplot(df_final, vars=['recency','frequency','monetary'],
hue='Cluster', palette='bright')
plt.show()
Merupakan salah satu clustering model dengan konsep menemukan sebuah medoid dalam cluster yang merupakan titik pusat dari suatu cluster. Metode ini bekerja dengan mencari k sebagai objek representatif untuk meminimalisir jumlah dissimilarties dari data objek.
Sumber : Preeti Arora, Deepali, Shipra Varshney, Analysis of K-Means and K-Medoids Algorithm For Big Data, Procedia Computer Science, Volume 78, 2016, Pages 507-512, ISSN 1877-0509
df_final_kmedoids_k4=Kmedoids(4, df_final)
df_final_kmedoids_k4
( customer_id recency frequency monetary Cluster 0 CS1112 -0.228849 -0.605640 -0.359103 3 1 CS1113 -0.533383 0.357262 0.667091 3 2 CS1114 -0.568521 0.164682 0.542574 3 3 CS1115 -0.814491 0.742423 1.029909 1 4 CS1116 1.434376 -0.990801 -0.691865 0 ... ... ... ... ... ... 6884 CS8996 0.192814 -0.990801 -1.282248 2 6885 CS8997 2.113721 -0.798220 -1.365975 0 6886 CS8998 0.040547 -0.990801 -1.192081 2 6887 CS8999 2.066870 -1.183381 -1.709471 0 6888 CS9000 -0.755927 -0.990801 -1.387444 2 [6889 rows x 5 columns], array([3, 3, 3, ..., 2, 0, 2], dtype=int64))
labels=[0,1,2,3]
plot_scatter_3D(labels, df_final)
C:\Users\hrd\anaconda3\envs\Keras\lib\site-packages\ipykernel_launcher.py:8: MatplotlibDeprecationWarning: Axes3D(fig) adding itself to the figure is deprecated since 3.4. Pass the keyword argument auto_add_to_figure=False and use fig.add_axes(ax) to suppress this warning. The default value of auto_add_to_figure will change to False in mpl3.5 and True values will no longer work in 3.6. This is consistent with other Axes classes.
after clustering plotting with k = 4
#another plot with sns.pairplot (optional)
sns.pairplot(df_final, vars=['recency','frequency','monetary'],
hue='Cluster', palette='bright')
plt.show()
wccs_kmedoids, k_list_kmedoids, silhouette_kmedoids=elbow_method_kmedoids(df_final)
silhouette_kmedoids=list(silhouette_kmedoids)
k_list_kmedoids=list(k_list_kmedoids)
wccs_kmedoids=list(wccs_kmedoids)
for i in range(len(k_list_kmedoids)):
print('k :', k_list_kmedoids[i], 'wccs :', wccs_kmedoids[i])
k : 2 wccs : 7416.987925395839 k : 3 wccs : 6603.766393227321 k : 4 wccs : 5541.80770805644 k : 5 wccs : 5066.557978887841 k : 6 wccs : 4642.036475366036 k : 7 wccs : 4432.5813604491195 k : 8 wccs : 4237.701635325206 k : 9 wccs : 4054.699978417469 k : 10 wccs : 3893.710161789086
#k-medoids elbow performance
g1=sns.lineplot(x=k_list_kmedoids,
y=wccs_kmedoids)
g1.set_xlabel('K Kmedoids')
g1.set_ylabel('WCCS Kmedoids')
g1.set_title('K vs WCCS Kmedoids')
Text(0.5, 1.0, 'K vs WCCS Kmedoids')
Berdasarkan grafik di atas, patahan elbow terletak pada k = 4 yang menandakan k optimal yaitu 4
df_final_kmedoids_k_optimal=Kmedoids(4, df_final)
df_final_kmedoids_k_optimal
( customer_id recency frequency monetary Cluster 0 CS1112 -0.228849 -0.605640 -0.359103 3 1 CS1113 -0.533383 0.357262 0.667091 3 2 CS1114 -0.568521 0.164682 0.542574 3 3 CS1115 -0.814491 0.742423 1.029909 1 4 CS1116 1.434376 -0.990801 -0.691865 0 ... ... ... ... ... ... 6884 CS8996 0.192814 -0.990801 -1.282248 2 6885 CS8997 2.113721 -0.798220 -1.365975 0 6886 CS8998 0.040547 -0.990801 -1.192081 2 6887 CS8999 2.066870 -1.183381 -1.709471 0 6888 CS9000 -0.755927 -0.990801 -1.387444 2 [6889 rows x 5 columns], array([3, 3, 3, ..., 2, 0, 2], dtype=int64))
labels=[0,1,2,3]
plot_scatter_3D(labels, df_final)
C:\Users\hrd\anaconda3\envs\Keras\lib\site-packages\ipykernel_launcher.py:8: MatplotlibDeprecationWarning: Axes3D(fig) adding itself to the figure is deprecated since 3.4. Pass the keyword argument auto_add_to_figure=False and use fig.add_axes(ax) to suppress this warning. The default value of auto_add_to_figure will change to False in mpl3.5 and True values will no longer work in 3.6. This is consistent with other Axes classes.
after clustering plotting with k = 4
#another plot with sns.pairplot (optional)
sns.pairplot(df_final, vars=['recency','frequency','monetary'],
hue='Cluster', palette='bright')
plt.show()
Berdasarkan (Arthur & Vassilvitskii, 2007) K-means++ adalah sebuah algoritma untuk memilih nilai awal atau disebut sebagai "seeds". Suatu cara unutk mengindari performa buruk dari clustering standar oleh K-means.
referensi : Bashar Aubaidan et al. / Journal of Computer Science 10 (7): 1197-1206, 2014
df_final_kmeans_plus_k4=Kmeans_plus(4, df_final)
df_final_kmeans_plus_k4
( customer_id recency frequency monetary Cluster 0 CS1112 -0.228849 -0.605640 -0.359103 0 1 CS1113 -0.533383 0.357262 0.667091 0 2 CS1114 -0.568521 0.164682 0.542574 0 3 CS1115 -0.814491 0.742423 1.029909 1 4 CS1116 1.434376 -0.990801 -0.691865 3 ... ... ... ... ... ... 6884 CS8996 0.192814 -0.990801 -1.282248 2 6885 CS8997 2.113721 -0.798220 -1.365975 3 6886 CS8998 0.040547 -0.990801 -1.192081 2 6887 CS8999 2.066870 -1.183381 -1.709471 3 6888 CS9000 -0.755927 -0.990801 -1.387444 2 [6889 rows x 5 columns], array([0, 0, 0, ..., 2, 3, 2]))
labels=[0,1,2,3]
plot_scatter_3D(labels, df_final)
C:\Users\hrd\anaconda3\envs\Keras\lib\site-packages\ipykernel_launcher.py:8: MatplotlibDeprecationWarning: Axes3D(fig) adding itself to the figure is deprecated since 3.4. Pass the keyword argument auto_add_to_figure=False and use fig.add_axes(ax) to suppress this warning. The default value of auto_add_to_figure will change to False in mpl3.5 and True values will no longer work in 3.6. This is consistent with other Axes classes.
after clustering plotting with k = 4
#another plot with sns.pairplot (optional)
sns.pairplot(df_final, vars=['recency','frequency','monetary'],
hue='Cluster', palette='bright')
plt.show()
wccs_kmeans_plus, k_list_kmeans_plus, silhouette_kmeans_plus=elbow_method_kmeans_plus(df_final)
wccs_kmeans_plus=list(wccs_kmeans_plus)
k_list_kmeans_plus=list(k_list_kmeans_plus)
#wccs result
for i,k in enumerate(k_list_kmeans_plus):
print('k :', k, 'wccs :', wccs_kmeans_plus[i])
k : 2 wccs : 11182.236337667822 k : 3 wccs : 7874.944569364721 k : 4 wccs : 5925.3574119628465 k : 5 wccs : 4918.354162460282 k : 6 wccs : 4212.8062354511585 k : 7 wccs : 3593.6200670903077 k : 8 wccs : 3267.153692894465 k : 9 wccs : 2991.1224530529153 k : 10 wccs : 2711.5165635231733
g2=sns.lineplot(x=k_list_kmeans_plus,
y=wccs_kmeans_plus)
g2.set_xlabel('K K-Means++')
g2.set_ylabel('WCCS K-Means++')
g2.set_title('K vs WCCS (K-Means++)')
Text(0.5, 1.0, 'K vs WCCS (K-Means++)')
Berdasarkan dari grafik di atas, jumlah k optimal sama dengan versi k-means++ adalah 4 dikarenakan k=4 adalah titik patahan elbow yang terdapat pada line grafik.
df_final_kmeans_plus_k_optimal=Kmeans_plus(4, df_final)
df_final_kmeans_plus_k_optimal
( customer_id recency frequency monetary Cluster 0 CS1112 -0.228849 -0.605640 -0.359103 0 1 CS1113 -0.533383 0.357262 0.667091 0 2 CS1114 -0.568521 0.164682 0.542574 0 3 CS1115 -0.814491 0.742423 1.029909 1 4 CS1116 1.434376 -0.990801 -0.691865 3 ... ... ... ... ... ... 6884 CS8996 0.192814 -0.990801 -1.282248 2 6885 CS8997 2.113721 -0.798220 -1.365975 3 6886 CS8998 0.040547 -0.990801 -1.192081 2 6887 CS8999 2.066870 -1.183381 -1.709471 3 6888 CS9000 -0.755927 -0.990801 -1.387444 2 [6889 rows x 5 columns], array([0, 0, 0, ..., 2, 3, 2]))
labels=[0,1,2,3]
plot_scatter_3D(labels, df_final)
C:\Users\hrd\anaconda3\envs\Keras\lib\site-packages\ipykernel_launcher.py:8: MatplotlibDeprecationWarning: Axes3D(fig) adding itself to the figure is deprecated since 3.4. Pass the keyword argument auto_add_to_figure=False and use fig.add_axes(ax) to suppress this warning. The default value of auto_add_to_figure will change to False in mpl3.5 and True values will no longer work in 3.6. This is consistent with other Axes classes.
after clustering plotting with k = 4
#another plot with sns.pairplot (optional)
sns.pairplot(df_final, vars=['recency','frequency','monetary'],
hue='Cluster', palette='bright')
plt.show()
C. Performansi Clustering dengan Menghitung Koefisien Silhouette

df_silhoutte_score_comparison=pd.DataFrame({'Solusi Cluster[k]':[2,3,4,5,6,7,8,9,10],
'K-Means':silhouette_kmeans,
'K-Medoids':silhouette_kmedoids,
'K-Means++':silhouette_kmeans_plus})
df_silhoutte_score_comparison
| Solusi Cluster[k] | K-Means | K-Medoids | K-Means++ | |
|---|---|---|---|---|
| 0 | 2 | 0.414116 | 0.414588 | 0.414128 |
| 1 | 3 | 0.408818 | 0.355598 | 0.408720 |
| 2 | 4 | 0.356066 | 0.346026 | 0.355865 |
| 3 | 5 | 0.369397 | 0.352571 | 0.369295 |
| 4 | 6 | 0.329537 | 0.314571 | 0.328776 |
| 5 | 7 | 0.336413 | 0.309939 | 0.336367 |
| 6 | 8 | 0.305661 | 0.275021 | 0.305189 |
| 7 | 9 | 0.306369 | 0.283795 | 0.308197 |
| 8 | 10 | 0.317772 | 0.284839 | 0.317664 |
fig, ax = plt.subplots(figsize=(18,7))
labels=['K-Means','K-Medoids','K-Means++']
g1=sns.lineplot(x='Solusi Cluster[k]',
y='K-Means',
data=df_silhoutte_score_comparison,
ax=ax, color='b',
label=labels[0])
g1.set_xlabel('K')
g1.set_ylabel('Silhouette')
g1.set_title('K-Means vs K-Means++')
g1.legend()
g2=sns.lineplot(x='Solusi Cluster[k]',
y='K-Medoids',
data=df_silhoutte_score_comparison,
ax=ax, color='r',
label=labels[1])
g2.legend()
g3=sns.lineplot(x='Solusi Cluster[k]',
y='K-Means++',
data=df_silhoutte_score_comparison,
ax=ax, color='y',
label=labels[2])
g3.legend()
plt.show()
Berdasarkan dataframe di atas k = 2 menghasilkan silhoutte terbesar, sehingga k = 2 adalah jumlah optimal untuk melakukan clusterisasi
df_final_kmedoids_k_optimal=Kmedoids(2, df_final)
df_final_kmedoids_k_optimal
( customer_id recency frequency monetary Cluster 0 CS1112 -0.228849 -0.605640 -0.359103 0 1 CS1113 -0.533383 0.357262 0.667091 1 2 CS1114 -0.568521 0.164682 0.542574 1 3 CS1115 -0.814491 0.742423 1.029909 1 4 CS1116 1.434376 -0.990801 -0.691865 0 ... ... ... ... ... ... 6884 CS8996 0.192814 -0.990801 -1.282248 0 6885 CS8997 2.113721 -0.798220 -1.365975 0 6886 CS8998 0.040547 -0.990801 -1.192081 0 6887 CS8999 2.066870 -1.183381 -1.709471 0 6888 CS9000 -0.755927 -0.990801 -1.387444 0 [6889 rows x 5 columns], array([0, 1, 1, ..., 0, 0, 0], dtype=int64))
labels=[0,1]
plot_scatter_3D(labels, df_final)
C:\Users\hrd\anaconda3\envs\Keras\lib\site-packages\ipykernel_launcher.py:8: MatplotlibDeprecationWarning: Axes3D(fig) adding itself to the figure is deprecated since 3.4. Pass the keyword argument auto_add_to_figure=False and use fig.add_axes(ax) to suppress this warning. The default value of auto_add_to_figure will change to False in mpl3.5 and True values will no longer work in 3.6. This is consistent with other Axes classes.
after clustering plotting with k = 2
#another plot with sns.pairplot (optional)
sns.pairplot(df_final, vars=['recency','frequency','monetary'],
hue='Cluster', palette='bright')
plt.show()
#Silhouette Plot K Means
from sklearn.metrics import silhouette_samples
import matplotlib.cm as cm
data=df_final.copy()
data=data[['recency','frequency','monetary']].values
for k in range(2,11):
fig, ax = plt.subplots(figsize=(18,7))
# The 1st subplot is the silhouette plot
# The silhouette coefficient can range from -1, 1 but in this example all
# lie within [-0.1, 1]
ax.set_xlim([-0.1, 1])
# The (k+1)*10 is for inserting blank space between silhouette
# plots of individual clusters, to demarcate them clearly.
ax.set_ylim([0, len(data) + (k + 1) * 10])
clusterer = KMeans(n_clusters=k, init='random', random_state=0)
cluster_labels = clusterer.fit_predict(data)
# The silhouette_score gives the average value for all the samples.
# This gives a perspective into the density and separation of the formed
# clusters
silhouette_avg = silhouette_score(data, cluster_labels)
print(
"For n_clusters =",
k,
"The average silhouette_score is :",
silhouette_avg,
)
# Compute the silhouette scores for each sample
sample_silhouette_values = silhouette_samples(data, cluster_labels)
y_lower = 10
for i in range(k):
# Aggregate the silhouette scores for samples belonging to
# cluster i, and sort them
ith_cluster_silhouette_values = sample_silhouette_values[cluster_labels == i]
ith_cluster_silhouette_values.sort()
size_cluster_i = ith_cluster_silhouette_values.shape[0]
y_upper = y_lower + size_cluster_i
color = cm.nipy_spectral(float(i) / k)
ax.fill_betweenx(
np.arange(y_lower, y_upper),
0,
ith_cluster_silhouette_values,
facecolor=color,
edgecolor=color,
alpha=0.7,
)
# Label the silhouette plots with their cluster numbers at the middle
ax.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))
# Compute the new y_lower for next plot
y_lower = y_upper + 10 # 10 for the 0 samples
ax.set_title("The silhouette plot for the various clusters.")
ax.set_xlabel("The silhouette coefficient values")
ax.set_ylabel("Cluster label")
# The vertical line for average silhouette score of all the values
ax.axvline(x=silhouette_avg, color="red", linestyle="--")
ax.set_yticks([]) # Clear the yaxis labels / ticks
ax.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])
plt.suptitle(
"Silhouette analysis for KMeans clustering on sample data with n_clusters = %d"
% k,
fontsize=14,
fontweight="bold",
)
plt.show()
For n_clusters = 2 The average silhouette_score is : 0.41411565474670786 For n_clusters = 3 The average silhouette_score is : 0.4088184017826584 For n_clusters = 4 The average silhouette_score is : 0.3560659435571489 For n_clusters = 5 The average silhouette_score is : 0.369397306521251 For n_clusters = 6 The average silhouette_score is : 0.3295366813368042 For n_clusters = 7 The average silhouette_score is : 0.3364129259117291 For n_clusters = 8 The average silhouette_score is : 0.3056614916871712 For n_clusters = 9 The average silhouette_score is : 0.30636875420806964 For n_clusters = 10 The average silhouette_score is : 0.3177716654348632
#Silhouette Plot K Medoids
data=df_final.copy()
data=data[['recency','frequency','monetary']].values
for k in range(2,11):
fig, ax = plt.subplots(figsize=(18,7))
# The 1st subplot is the silhouette plot
# The silhouette coefficient can range from -1, 1 but in this example all
# lie within [-0.1, 1]
ax.set_xlim([-0.1, 1])
# The (k+1)*10 is for inserting blank space between silhouette
# plots of individual clusters, to demarcate them clearly.
ax.set_ylim([0, len(data) + (k + 1) * 10])
clusterer = KMedoids(n_clusters=k, random_state=0)
cluster_labels = clusterer.fit_predict(data)
# The silhouette_score gives the average value for all the samples.
# This gives a perspective into the density and separation of the formed
# clusters
silhouette_avg = silhouette_score(data, cluster_labels)
print(
"For n_clusters =",
k,
"The average silhouette_score is :",
silhouette_avg,
)
# Compute the silhouette scores for each sample
sample_silhouette_values = silhouette_samples(data, cluster_labels)
y_lower = 10
for i in range(k):
# Aggregate the silhouette scores for samples belonging to
# cluster i, and sort them
ith_cluster_silhouette_values = sample_silhouette_values[cluster_labels == i]
ith_cluster_silhouette_values.sort()
size_cluster_i = ith_cluster_silhouette_values.shape[0]
y_upper = y_lower + size_cluster_i
color = cm.nipy_spectral(float(i) / k)
ax.fill_betweenx(
np.arange(y_lower, y_upper),
0,
ith_cluster_silhouette_values,
facecolor=color,
edgecolor=color,
alpha=0.7,
)
# Label the silhouette plots with their cluster numbers at the middle
ax.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))
# Compute the new y_lower for next plot
y_lower = y_upper + 10 # 10 for the 0 samples
ax.set_title("The silhouette plot for the various clusters.")
ax.set_xlabel("The silhouette coefficient values")
ax.set_ylabel("Cluster label")
# The vertical line for average silhouette score of all the values
ax.axvline(x=silhouette_avg, color="red", linestyle="--")
ax.set_yticks([]) # Clear the yaxis labels / ticks
ax.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])
plt.suptitle(
"Silhouette analysis for KMeans clustering on sample data with n_clusters = %d"
% k,
fontsize=14,
fontweight="bold",
)
plt.show()
For n_clusters = 2 The average silhouette_score is : 0.41458814656064213 For n_clusters = 3 The average silhouette_score is : 0.3555983157364002 For n_clusters = 4 The average silhouette_score is : 0.3460260086321125 For n_clusters = 5 The average silhouette_score is : 0.35257065850376074 For n_clusters = 6 The average silhouette_score is : 0.31457136889653836 For n_clusters = 7 The average silhouette_score is : 0.30993938640564156 For n_clusters = 8 The average silhouette_score is : 0.27502101516207217 For n_clusters = 9 The average silhouette_score is : 0.2837954918530233 For n_clusters = 10 The average silhouette_score is : 0.28483942816632085
#Silhouette Plot K Means++
data=df_final.copy()
data=data[['recency','frequency','monetary']].values
for k in range(2,11):
fig, ax = plt.subplots(figsize=(18,7))
# The 1st subplot is the silhouette plot
# The silhouette coefficient can range from -1, 1 but in this example all
# lie within [-0.1, 1]
ax.set_xlim([-0.1, 1])
# The (k+1)*10 is for inserting blank space between silhouette
# plots of individual clusters, to demarcate them clearly.
ax.set_ylim([0, len(data) + (k + 1) * 10])
clusterer = KMeans(n_clusters=k, init='k-means++', random_state=0)
cluster_labels = clusterer.fit_predict(data)
# The silhouette_score gives the average value for all the samples.
# This gives a perspective into the density and separation of the formed
# clusters
silhouette_avg = silhouette_score(data, cluster_labels)
print(
"For n_clusters =",
k,
"The average silhouette_score is :",
silhouette_avg,
)
# Compute the silhouette scores for each sample
sample_silhouette_values = silhouette_samples(data, cluster_labels)
y_lower = 10
for i in range(k):
# Aggregate the silhouette scores for samples belonging to
# cluster i, and sort them
ith_cluster_silhouette_values = sample_silhouette_values[cluster_labels == i]
ith_cluster_silhouette_values.sort()
size_cluster_i = ith_cluster_silhouette_values.shape[0]
y_upper = y_lower + size_cluster_i
color = cm.nipy_spectral(float(i) / k)
ax.fill_betweenx(
np.arange(y_lower, y_upper),
0,
ith_cluster_silhouette_values,
facecolor=color,
edgecolor=color,
alpha=0.7,
)
# Label the silhouette plots with their cluster numbers at the middle
ax.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))
# Compute the new y_lower for next plot
y_lower = y_upper + 10 # 10 for the 0 samples
ax.set_title("The silhouette plot for the various clusters.")
ax.set_xlabel("The silhouette coefficient values")
ax.set_ylabel("Cluster label")
# The vertical line for average silhouette score of all the values
ax.axvline(x=silhouette_avg, color="red", linestyle="--")
ax.set_yticks([]) # Clear the yaxis labels / ticks
ax.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])
plt.suptitle(
"Silhouette analysis for KMeans++ clustering on sample data with n_clusters = %d"
% k,
fontsize=14,
fontweight="bold",
)
plt.show()
For n_clusters = 2 The average silhouette_score is : 0.4141283277852072 For n_clusters = 3 The average silhouette_score is : 0.4087198588784448 For n_clusters = 4 The average silhouette_score is : 0.3558645675658123 For n_clusters = 5 The average silhouette_score is : 0.36929531195297977 For n_clusters = 6 The average silhouette_score is : 0.3287755294421215 For n_clusters = 7 The average silhouette_score is : 0.33636672040071103 For n_clusters = 8 The average silhouette_score is : 0.30518881816212606 For n_clusters = 9 The average silhouette_score is : 0.30819681669713467 For n_clusters = 10 The average silhouette_score is : 0.31766393187180303